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Reinforcement learning occurs when 
organisms adapt behavior on the basis 
of associations with reward and punish- 
ment.Reinforcement learning is a useful 
algorithm because it is unsupervised, 
relying on trial-and-error learning under 
conditions in which the optimal solution 
is unknown. Recent neural network mod- 
els of reinforcement learning are based on 
the neurophysiology of the rat, monkey, 
and human dopamine systems (Montague 
et al., 1996; Dayan and Balleine, 2002; 
Schultz, 2002; Montague et al, 2004; 
Pan et al, 2008). The main finding of 
this research is that the dopamine system 
appears to minimize errors in the predic- 
tion of reward through a process called 
temporal difference learning. As predicted 
by the temporal difference learning mod- 
els, dopamine neurons respond during the 
early stages of classical and operant con- 
ditioning with a burst of action potentials 
(a phasic-like response) after reward pre- 
sentation (Schultz, 1998; O'Doherty et al, 
2006). However, after repeated pairings of 
a given stimulus and reinforcement, the 
dopamine neurons respond to the onset 
of the stimulus, be it a conditioned stim- 
ulus or a cue that triggers a stereotyped 
action that results in reward (Mirenowicz 
and Schultz, 1994). After an association 
has been formed between the stimulus and 
reinforcement, dopamine ceases respond- 
ing to the reinforcer itself (Schultz et al., 
1997). 

Based on these neurophysiological 
data, reinforcement learning models have 
proposed that the role of the midbrain 
phasic DA neurons is to act as a teach- 
ing signal which adjusts reward prediction 
errors and broadcasts such information 
to upstream cell populations involved 



in reward learning such as the nucleus 
accumbens (NAc) (Joel et al, 2002; 
Wassum et al, 2013). More recently, a 
number of computational studies have 
added another layer of complexity to their 
models by incorporating the idea of incen- 
tive motivation as a way to better capture 
the role of dopamine in reward learning 
(McClure et al, 2003; Niv, 2007; Zhang 
et al, 2009; Morita et al, 2013).This has 
largely been based on findings from lesion 
and pharmacological studies whereby it 
has been hypothesized that dopamine 
neurons respond to conditioned stimuli by 
invigorating instrumental actions that lead 
to the obtainment of rewards (Berridge 
et al, 2009; Wassum et al, 201 1). 

In the meantime, a number of authors 
have suggested that because midbrain 
dopamine neurons also respond to aver- 
sive and salient stimuli by phasic DA acti- 
vations (Matsumoto and Hikosaka, 2009 
Cohen et al., 2012; Ilango et al, 2012 
Tan et al, 2012; Brooks and Berns, 2013 
Fiorillo et al., 2013), that their role in 
encoding reward prediction errors may be 
more limited than first envisaged (Horvitz 
et al., 1997; Redgrave and Gurney, 2006; 
Redgrave et al, 2008; May et al, 2009; 
Thirkettle et al., 2013). The scope of this 
Opinion article, however, is not to assess 
the validity of such claims. 

On the contrary, the aim of this arti- 
cle is to focus on one area of research 
that has received relatively little atten- 
tion, namely, how the phasic DA signal 
may be causally related to action selec- 
tion, goal-directed behavior, and behav- 
ioral flexibility. This is partially because 
the vast majority of studies which have 
explored whether DA neurons may encode 
more than reward prediction errors (e.g., 



including measures related to behavioral 
flexibility such as reward value, reward 
probability, choice behavior, discounting 
of delayed rewards) (Fiorillo et al., 2003, 
2008; Morris et al, 2006; Roesch et al, 
2007; Takahashi et al, 2009; Bromberg- 
Martin et al, 2010a,b; Nomoto et al., 
2010), have been based upon electro- 
physiological data, which by their very 
nature can only support a correlation 
between neuronal activation and inhibi- 
tion with behavior but cannot establish 
causation. This has been acknowledged 
by a statement from Wolfram Schultz 
who declared that "although the predic- 
tion error response of dopamine neurons 
would make a good teaching signal, the 
bulk of the available data are correlational" 
(Schultz, 2010). Therefore, to establish 
causation we will look at a number of 
recent studies that have used primarily, 
optogenetic, voltammetry and pharmaco- 
logical interventions and that may provide 
an answer to this question. 

With the recent introduction of 
optogenetics, for example, it has been 
possible to perturb neural activity at mil- 
lisecond timescales and directly relate this 
manipulation to an array of behaviors 
including sleep, anxiety, depression, and 
fear, to name but a few (Rolls et al., 2011; 
Kim et al, 2013; Tye et al, 2013; Courtin 
et al., 2014). More specifically, midbrain 
DA neurons and their striatal projec- 
tions have also been selectively targeted 
resulting in behavioral modifications of 
food intake, cocaine consumption, condi- 
tioned place preference and aversion (by 
inhibition of DA activity via GABAergic 
VTA cells) (Tsai et al., 2009; Lobo et al, 
2010; Domingos et al, 2011; Tan et al., 
2012). 
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Optogenetic targeting of midbrain DA 
cells and their striatal projections, has also 
revealed interesting observations regard- 
ing their causal role in reward prediction, 
and possibly, behavioral flexibility. With 
regards to the causal role of DA in reward 
prediction (Kim et al., 2012), the authors 
showed that phasic activation of VTA DA 
neurons after a nose poke could drive 
operant responses in the absence of food 
reward. In another laboratory, a blocking 
procedure was used to demonstrate that 
activation of DA neurons at the time of 
reward delivery during compound stimu- 
lus presentation could artificially produce 
a conditioned response to the normally 
blocked cue. In other words, phasic DA 
stimulation at a point in time (reward 
delivery) when this would normally be 
absent could unblock learning (Steinberg 
etal, 2013). 

In a separate study looking at manip- 
ulation of the GABAergic cells of the 
VTA on reward learning and its effect 
on DA release, optogenetic stimulation of 
VTA GABAergic neurons disrupted con- 
summatory behavior but not if the VTA 
GABA projections to the NAc were tar- 
geted. Moreover, stimulation of the GABA 
neurons suppressed VTA DA firing and 
release in the NAc (Van Zessen et al, 
2012). In a further study to characterize 
the VTA GABA projections to the NAc, 
it was found that activation of this path- 
way selectively inhibited cholinergic neu- 
rons of the NAc which in turn increased 
associative learning of an aversive predic- 
tive cue (Brown et al., 2012). Importantly, 
this effect was dopamine independent, as 
stimulation of GABA terminals in the NAc 
did not change baseline firing of VTA DA 
cells. Taken together, these studies confirm 
that within the VTA, DA activity regulates 
aspects related to appetitive reward learn- 
ing. Moreover, these data highlight how 
the encoding of an aversive outcome may 
not only be signaled by DA cells project- 
ing to the NAc but also by activation of 
cholinergic cells in the NAc that receive 
preferential input from VTA GABA neu- 
rons, extending the results from previous 
investigations (Tan et al., 2012). 

With regards to the causal role of DA 
in behavioral flexibility, in a recent study 
(Adamantidis et al., 2011), the authors 
targeted the dopaminergic neurons of 
the VTA by injecting channelrhodopsin-2 



(ChR2) in Th-Cre mice. The initial behav- 
ioral paradigm required mice to bar press 
one of two levers. The "active" lever 
resulted in food delivery plus optogenetic 
stimulation whereas bar pressing on the 
"inactive" lever resulted in the delivery 
of food only. Compared to controls (YFP 
mice), phasic DA stimulation enhanced 
the effects of food-reward seeking (i.e., 
mice bar pressed the active lever preferen- 
tially over the inactive). Interestingly, they 
also found that after a series of extinc- 
tion sessions during which no food reward 
or phasic DA stimulation occurred, pref- 
erential lever pressing (to the initial active 
lever) could be reestablished by DA stim- 
ulation in the absence of both external 
cues and, critically, food reward. Finally, 
the authors used a reversal learning session 
where the relationship between the active 
(optical stimulation + no food reward) 
and inactive (no optical stimulation + no 
food reward) levers were switched, and 
demonstrated that ChR2 mice switched 
their lever pressing to the previously inac- 
tive lever compared to control mice. This 
finding is particularly important because 
it suggests that not only is the phasic 
DA signal driving and enhancing sim- 
ple stimulus -reward associations but it is 
also causally involved in flexible behav- 
ioral adaptations that occur as a result of 
changes in stimulus-reward contingencies. 

Behavioral flexibility has also been 
tested by optogenetic manipulations of 
dopamine receiving NAc neurons. In a 
recent study, dopamine Dl and D2 recep- 
tors were selectively targeted while Dl-cre 
and D2-cre mice were performing a prob- 
abilistic switching task (Tai et al., 2012). 
The results showed that activation of Dl 
and D2 neurons was effective at increas- 
ing lose-shift behavior (i.e., moving from 
an incorrect to a correct response) com- 
pared to controls but had no effect on 
win-stay performance (i.e., repeating the 
previously rewarded response). Moreover, 
the effect was dependent on whether stim- 
ulation occurred before movement initia- 
tion but not if it was delayed by 150 ms. 
Interestingly, we recently found (Aquili 
et al., 2014) that non-specific optogenetic 
inhibition and not excitation of NAc shell 
neurons increased lose-shift behavior but 
only if the inhibition occurred during 
feedback of results (between lever press- 
ing and rewards or non-rewards) but not 



during action selection (preceding a lever 
press). We speculated that inhibition of 
NAc cells during specific time segments 
may have weakened reward expectancy 
signals which would in turn facilitate 
switching to a correct response after an 
error. 

Differential effects between NAc core 
and shell on learning have been observed 
using fast-scan cyclic voltammetry which 
may explain the contradictory findings 
from the two previous optogenetic stud- 
ies. In fact, in one study cue-evoked 
dopamine release was larger and longer 
lasting in the NAc shell than in the core 
during goal-directed behavior for sucrose 
(Cacciapaglia et al., 2012). In two related 
studies, it was also found that concen- 
trations of cue-evoked DA release closely 
tracked differences in reward magnitude 
in the NAc shell (Beyene et al, 2010) and 
reward delays in both NAc core and shell 
(Wanat et al., 2010). DA reward predic- 
tion error signals in the NAc core have also 
been reported using voltammetry (Hart 
et al, 2014). Here, using a probabilistic 
decision-making task, the authors found 
that dopamine concentrations varied sys- 
tematically as differing degrees of reward 
uncertainty were introduced, in a man- 
ner closely resembling the predictions of 
reinforcement learning models and elec- 
trophysiological data of VTA DA neurons. 
Similarly, the observation that the DA pha- 
sic response to rewards gradually shifts 
to the earliest predictor of reinforcement 
over the course of learning as predicted 
by temporal difference models (Sutton and 
Barto, 1981) and validated by DA elec- 
trophysiological recordings, has been con- 
firmed by voltammetric data (Sunsay and 
Rebec, 2008). These findings are impor- 
tant because changes in firing rates may 
not always reflect changes in DA release 
(Youngren et al., 1993), and these voltam- 
metric data allow us to better establish the 
causal role of DA in reward learning. 

Data from pharmacological manipula- 
tion of (mostly) dopamine Dl and D2 
function in the striatum is another impor- 
tant component to take into account 
when trying to establish a causal link 
between neural activity and behavior. 
Dopamine depletion, for example, in the 
dorsomedial striatum results in rever- 
sal learning impairments (O'Neill and 
Brown, 2007). Moreover, in stimulant 
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dependent individuals who display per- 
severative behaviors following an incor- 
rect response during a reversal learning 
task, administration of a dopamine D2/3 
antagonist reduced perseverative errors 
and improved caudate nucleus function 
(Ersche et al., 2011), and in separate 
study, administration of a D2 antago- 
nist enhanced reward related prediction 
error signals in the striatum (Jocham 
etal., 2011). Conversely, stimulation of D2 
(but not Dl) receptors using the agonist 
quinpirole impaired goal-directed behav- 
ior and decision making (St Onge et al., 
201 1; Naneix et al., 2013) and broad inac- 
tivation of caudate nucleus cells disrupted 
the ability for flexible responses based 
on previous reward history (Muranishi 
et al., 2011). Interestingly, in monkeys, D2 
receptor availability in the dorsal striatum 
was correlated with the number of rever- 
sal learning errors (Groman et al, 2011). 
Overall, these data suggest that abnormal 
increases/decreases in striatum DA activ- 
ity via D1/D2 receptors causally influence 
several important measures of behavioral 
flexibility. 

Studies that have looked at increasing 
dopamine concentration have demon- 
strated that DA stimulation by injec- 
tion of amphetamine in the NAc core 
or shell increased instrumental respond- 
ing to a conditioned stimulus predictive of 
reward (Pecina and Berridge, 2013), and 
administration of the dopamine precursor 
L-DOEA in older adults restored reward 
prediction error signaling (Chowdhury 
etal, 2013). 

In conclusion, increasing evidence from 
optogenetic, voltammetry, and pharmaco- 
logical studies over the recent years have 
added a new dimension to the established 
but mostly correlation role between the 
midbrain DA neurons and reward learn- 
ing. This evidence suggests that this pha- 
sic response may have a causal role not 
only in reward prediction error signal- 
ing, but also in driving flexible behavioral 
adaptations to changes in stimulus-reward 
contingencies. 
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